library(tidyverse)
set.seed(1)
theme_set(theme_minimal())
options(scipen=15, digits = 5)
There are several new packages used in this tutorial. I show throughout where they are coming in so you can see which functions are coming from which package, but here’s the list:
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Registered S3 methods overwritten by 'lme4':
## method from
## cooks.distance.influence.merMod car
## influence.merMod car
## dfbeta.influence.merMod car
## dfbetas.influence.merMod car
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(corrplot)
## corrplot 0.84 loaded
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
library(ppcor)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
source("cor.mtest.R")
Here we’re using semi-toy data. Not a real study, but the relationships between the visual features are from real data.
In this fake study, participants saw an image for 1,2, or 3 seconds (Duration, continous IV). They had to rate on a scale of 1-7 how much they liked the image (Preference, continuous DV). Each image was previously rated for how natural the image was (Natural, continous IV 1-7). Towards the end, we’ll add on other image feature variables.
This is not a sample analysis pipeline. This tutorial is showing various tools in R to do regression analysis and how to calculate certain values by hand, so you understand where R’s output is coming from.
natprefdata <- read_csv("RegressionData1.csv")
## Parsed with column specification:
## cols(
## Subject = col_double(),
## Image = col_double(),
## Duration = col_double(),
## Natural = col_double(),
## Preference = col_double()
## )
First we’ll visualize the data.
ggplot(natprefdata) + aes(x = Natural, y = Preference, color=Duration) + geom_point()
We can also summarize the data to the “by image” level before visualizing, collapsing across Duration.
byImage <- natprefdata %>% group_by(Image) %>% summarise(meanPref=mean(Preference),Natural=mean(Natural))
p1 <- ggplot(byImage) + aes(x = Natural, y = meanPref) +
geom_point() +
coord_cartesian(xlim=c(1,7.5), ylim=c(2,7))
p1
Does Naturalness predict Preference?
Calculating best fit line from scratch.
meanX <- mean(byImage$Natural)
meanX
## [1] 4.7
meanY <- mean(byImage$meanPref)
meanY
## [1] 5.0733
byImage2 <- byImage %>%
mutate(minusMeanX = Natural-meanX,
minusMeanY = meanPref-meanY,
productminusmeans = minusMeanX*minusMeanY,
minusMeanXsqrd = (Natural-meanX)^2)
head(byImage2)
## # A tibble: 6 x 7
## Image meanPref Natural minusMeanX minusMeanY productminusmeans minusMeanXsqrd
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 3.79 6.78 2.08 -1.28 -2.66 4.33
## 2 2 4.24 6.81 2.11 -0.833 -1.76 4.45
## 3 3 4.52 6.52 1.82 -0.557 -1.01 3.31
## 4 4 4.54 6.79 2.09 -0.534 -1.12 4.37
## 5 5 4.88 6.67 1.97 -0.192 -0.377 3.88
## 6 6 4.93 6.63 1.93 -0.145 -0.279 3.72
m <- sum(byImage2$productminusmeans)/sum(byImage2$minusMeanXsqrd)
m
## [1] 0.13826
# Rewrite above equation as b = y - mx
b <- meanY - m*meanX
b
## [1] 4.4235
Our best fit line is Preference = 0.14*Naturalness + 4.42. Let’s add that to our plot. This could also be written as Y = 4.42 + 0.14*Naturalness + epsilon.
p1 + geom_abline(slope = m, intercept = b)
What does R give us for the linear regression?
lm1 <- lm(meanPref ~ Natural, data = byImage)
summary(lm1)
##
## Call:
## lm(formula = meanPref ~ Natural, data = byImage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1790 -0.4237 0.0164 0.6326 1.3257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4235 0.2519 17.56 <0.0000000000000002 ***
## Natural 0.1383 0.0492 2.81 0.0067 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.777 on 58 degrees of freedom
## Multiple R-squared: 0.12, Adjusted R-squared: 0.105
## F-statistic: 7.9 on 1 and 58 DF, p-value: 0.00672
What does that output mean?
Call: shows you the formula you used in the regression
Residuals: Quantiles of the residuals
Coefficients: shows the estimates, standard errors, t-values, and p-values for the coefficients (betas) in your model, including the y-intercept (b)
Residual standard error: sqrt(MSE) with n-p degrees of freedom
R-squared: SSR/SST
Adjusted R^2: 1- [(1-R^2)(n-1)/(n-k-1)] where n = obs, k=num variables in model, not constants
F-statistic: MSR/MSE
What would happen to the slope and intercept if we used the original dataset? Their values remain the same, but our SE and p-values got smaller.
lm2 <- lm(Preference ~ Natural, data = natprefdata)
summary(lm2)
##
## Call:
## lm(formula = Preference ~ Natural, data = natprefdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6367 -0.4911 0.0652 0.5976 1.7124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4235 0.0590 74.9 <0.0000000000000002 ***
## Natural 0.1383 0.0115 12.0 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.814 on 1198 degrees of freedom
## Multiple R-squared: 0.107, Adjusted R-squared: 0.107
## F-statistic: 144 on 1 and 1198 DF, p-value: <0.0000000000000002
Reminder of how to get built-in R best fit line on the plot:
p1 + geom_smooth(method="lm", formula = "y ~ x")
What do the residuals look like?
plot(lm1$fitted.values, lm1$residuals)
Do Naturalness and Duration seen predict preference?
First we will standardize our variables. Remember, you can only directly compare the size of betas when the variables have been standardized. Here we are getting z-scores manually; later in the tutorial you’ll see the function scale() will do this for you.
#Standardizing variables
natprefdata_zscores <- natprefdata %>%
mutate(DurationZScore = (Duration - mean(Duration))/sd(Duration),
NaturalZScore = (Natural - mean(Natural))/sd(Natural),
PreferenceZScore = (Preference - mean(Preference))/sd(Preference))
natprefdata_zscores
## # A tibble: 1,200 x 8
## Subject Image Duration Natural Preference DurationZScore NaturalZScore
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 3 6.78 3.75 1.22 1.02
## 2 1 2 1 6.81 4.58 -1.22 1.03
## 3 1 3 3 6.52 4.42 1.22 0.892
## 4 1 4 1 6.79 4.75 -1.22 1.02
## 5 1 5 3 6.67 4.31 1.22 0.966
## 6 1 6 3 6.63 4.63 1.22 0.946
## 7 1 7 3 6.76 6.69 1.22 1.01
## 8 1 8 1 6.8 5.02 -1.22 1.03
## 9 1 9 2 4.05 5.16 0 -0.319
## 10 1 10 1 6.59 4.93 -1.22 0.926
## # … with 1,190 more rows, and 1 more variable: PreferenceZScore <dbl>
multreg1 <- lm(PreferenceZScore ~ NaturalZScore + DurationZScore, data=natprefdata_zscores)
summary(multreg1)
##
## Call:
## lm(formula = PreferenceZScore ~ NaturalZScore + DurationZScore,
## data = natprefdata_zscores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1077 -0.5680 0.0625 0.6848 2.0399
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -0.000000000000000299 0.027272888807801814 0.0
## NaturalZScore 0.328832581163430393 0.027299038924182395 12.1
## DurationZScore -0.040860366402140795 0.027299038924182405 -1.5
## Pr(>|t|)
## (Intercept) 1.00
## NaturalZScore <0.0000000000000002 ***
## DurationZScore 0.13
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.945 on 1197 degrees of freedom
## Multiple R-squared: 0.109, Adjusted R-squared: 0.107
## F-statistic: 73.2 on 2 and 1197 DF, p-value: <0.0000000000000002
#Confidence intervals for the beta
confint.lm(multreg1)
## 2.5 % 97.5 %
## (Intercept) -0.053508 0.053508
## NaturalZScore 0.275273 0.382392
## DurationZScore -0.094420 0.012699
plot(multreg1$fitted.values, multreg1$residuals)
We can look at the residuals visually for their spread and normality, as well as run statistical tests for heteroscedasticity.
plot(multreg1)
hist(multreg1$residuals)
#library(lmtest)
#Breusch-Pagan Test
bptest(multreg1)
##
## studentized Breusch-Pagan test
##
## data: multreg1
## BP = 3.69, df = 2, p-value = 0.16
bptest(multreg1, varformula = ~ fitted.values(multreg1))
##
## studentized Breusch-Pagan test
##
## data: multreg1
## BP = 3.64, df = 1, p-value = 0.056
#library(car)
# Score Test for Non-Constant Error Variance
ncvTest(multreg1)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 3.5867, Df = 1, p = 0.0582
I think this StackExchange answer provides a clear explanation of what these tests are doing differently: https://stats.stackexchange.com/a/261846
#SSE - sum((actual-predicted)^2)
natprefdata_zscores$Predicted <- multreg1$fitted.values
SSE <- natprefdata_zscores %>%
mutate(error = PreferenceZScore - Predicted,
errorsq = error^2) %>%
summarise(SSE = sum(errorsq)) %>%
pull(SSE)
SSE
## [1] 1068.4
#Another way
SSE1 <- sum((multreg1$residuals)^2)
#SST - sum((actual-mean)^2)
meanPref <- mean(natprefdata_zscores$PreferenceZScore)
SST <- natprefdata_zscores %>%
mutate(difffrommean = PreferenceZScore - meanPref,
diffsq = difffrommean^2) %>%
summarise(SST = sum(diffsq)) %>%
pull(SST)
SST
## [1] 1199
#Another way
SST1 <- sum((natprefdata_zscores$PreferenceZScore - mean(natprefdata_zscores$PreferenceZScore))^2)
#SSR - SST-SSE
SSR = SST-SSE
SSR
## [1] 130.59
# SSR just for naturalness
justnat <- lm(PreferenceZScore ~ NaturalZScore, data=natprefdata_zscores)
Red_SSR_justnat <- sum((justnat$fitted.values - mean(natprefdata_zscores$PreferenceZScore))^2)
Red_SSR_justnat
## [1] 128.59
# SSR just for duration
justduration <- lm(PreferenceZScore ~ DurationZScore, data=natprefdata_zscores)
Red_SSR_justdur <- sum((justduration$fitted.values - mean(natprefdata_zscores$PreferenceZScore))^2)
Red_SSR_justdur
## [1] 1.0821
#Unique SSRs
Uniq_nat_SSR = SSR - Red_SSR_justdur
Uniq_dur_SSR = SSR - Red_SSR_justnat
# Shared SSR
Shared_SSR = SSR - Uniq_dur_SSR - Uniq_nat_SSR
Shared_SSR
## [1] -0.91756
# Shared SSR might be low because Duration and Naturalness aren't correlated.
cor(natprefdata_zscores$DurationZScore, natprefdata_zscores$NaturalZScore)
## [1] 0.032901
#R^2 = SSR/SST
Rsq = SSR/SST
Rsq
## [1] 0.10892
# Adj R^2 = 1- [(1-R^2)(n-1)/(n-k-1)] where n = obs, k=num variables in model, not constants
n <- 1200
k <- 2
Adj_Rsq = 1 - ((1-Rsq)*(n-1)/(n-k-1))
Adj_Rsq
## [1] 0.10743
R doesn’t have a built in function for Cohen’s f-squared, but it’s pretty easy to calculate by hand.
# R^2 /(1 - R^2)
cohen_fsq = Rsq/(1-Rsq)
cohen_fsq
## [1] 0.12223
p <- 3
#MSE = SSE/df_err (df error = n-p)
MSE <- SSE/(n-p)
MSE
## [1] 0.89257
#Note from the 'summary' - Residual standard error was .944; That's the square root of MSE.
sqrt(MSE)
## [1] 0.94476
#MST = SST/df_tot (df total = n-1) n=observations
MST <- SST/(n-1)
MST
## [1] 1
#MSR = SSR/df_reg (df regression = p-1) p=num parameters
MSR <- SSR/(p-1)
MSR
## [1] 65.295
#F = MSR/MSE
F <- MSR/MSE
F
## [1] 73.154
# p-value for F-statistic
pval_F <- pf(F, p-1, n-p, lower.tail = FALSE)
pval_F
## [1] 1.0622e-30
What if each trial couldn’t be used as a continous DV (hint: like Hit Rate for our pilot study)?
image_cond <- natprefdata_zscores %>%
group_by(Image, Duration) %>%
summarize(meanPref = mean(PreferenceZScore), Natural=mean(NaturalZScore))
image_cond
## # A tibble: 180 x 4
## # Groups: Image [60]
## Image Duration meanPref Natural
## <dbl> <dbl> <dbl> <dbl>
## 1 1 1 -1.40 1.02
## 2 1 2 -1.43 1.02
## 3 1 3 -1.55 1.02
## 4 2 1 -1.04 1.03
## 5 2 2 -1.08 1.03
## 6 2 3 -0.852 1.03
## 7 3 1 -0.600 0.892
## 8 3 2 -0.647 0.892
## 9 3 3 -0.683 0.892
## 10 4 1 -0.430 1.02
## # … with 170 more rows
#Then we can run our lm on this dataset.
#Adding an interaction term between duration and naturalness
multreg_interaction <- lm(PreferenceZScore ~ NaturalZScore + DurationZScore + NaturalZScore*DurationZScore, data=natprefdata_zscores)
summary(multreg_interaction)
##
## Call:
## lm(formula = PreferenceZScore ~ NaturalZScore + DurationZScore +
## NaturalZScore * DurationZScore, data = natprefdata_zscores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0878 -0.5664 0.0668 0.6922 2.0160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.000497 0.027295 0.02 0.99
## NaturalZScore 0.328803 0.027307 12.04 <0.0000000000000002
## DurationZScore -0.041003 0.027308 -1.50 0.13
## NaturalZScore:DurationZScore -0.015133 0.027025 -0.56 0.58
##
## (Intercept)
## NaturalZScore ***
## DurationZScore
## NaturalZScore:DurationZScore
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.945 on 1196 degrees of freedom
## Multiple R-squared: 0.109, Adjusted R-squared: 0.107
## F-statistic: 48.8 on 3 and 1196 DF, p-value: <0.0000000000000002
#You can also do this to get the same output, but it's less explicit about the full model:
multreg_interaction1 <- lm(PreferenceZScore ~ NaturalZScore*DurationZScore, data=natprefdata_zscores)
summary(multreg_interaction1)
##
## Call:
## lm(formula = PreferenceZScore ~ NaturalZScore * DurationZScore,
## data = natprefdata_zscores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0878 -0.5664 0.0668 0.6922 2.0160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.000497 0.027295 0.02 0.99
## NaturalZScore 0.328803 0.027307 12.04 <0.0000000000000002
## DurationZScore -0.041003 0.027308 -1.50 0.13
## NaturalZScore:DurationZScore -0.015133 0.027025 -0.56 0.58
##
## (Intercept)
## NaturalZScore ***
## DurationZScore
## NaturalZScore:DurationZScore
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.945 on 1196 degrees of freedom
## Multiple R-squared: 0.109, Adjusted R-squared: 0.107
## F-statistic: 48.8 on 3 and 1196 DF, p-value: <0.0000000000000002
#Or if for some reason you only want the interaction term:
multreg_interaction2 <- lm(PreferenceZScore ~ NaturalZScore:DurationZScore, data=natprefdata_zscores)
summary(multreg_interaction2)
##
## Call:
## lm(formula = PreferenceZScore ~ NaturalZScore:DurationZScore,
## data = natprefdata_zscores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.426 -0.623 0.106 0.673 2.258
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.000509 0.028891 0.02 0.99
## NaturalZScore:DurationZScore -0.015480 0.028604 -0.54 0.59
##
## Residual standard error: 1 on 1198 degrees of freedom
## Multiple R-squared: 0.000244, Adjusted R-squared: -0.00059
## F-statistic: 0.293 on 1 and 1198 DF, p-value: 0.588
#library(lme4)
#Adding random intercepts
multreg_randomimage <- lmer(PreferenceZScore ~ NaturalZScore + DurationZScore + (1|Image), data=natprefdata_zscores)
summary(multreg_randomimage)
## Linear mixed model fit by REML ['lmerMod']
## Formula: PreferenceZScore ~ NaturalZScore + DurationZScore + (1 | Image)
## Data: natprefdata_zscores
##
## REML criterion at convergence: 1079.9
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.201 -0.618 -0.006 0.641 3.113
##
## Random effects:
## Groups Name Variance Std.Dev.
## Image (Intercept) 0.808 0.899
## Residual 0.111 0.334
## Number of obs: 1200, groups: Image, 60
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -0.00000000000000454 0.11644840661170815 0.00
## NaturalZScore 0.32741752345271813 0.11649741308696240 2.81
## DurationZScore 0.00214915488942733 0.00990663998136926 0.22
##
## Correlation of Fixed Effects:
## (Intr) NtrlZS
## NaturalZScr 0.000
## DuratinZScr 0.000 -0.003
multreg_randomimage_AIC <- lmer(PreferenceZScore ~ NaturalZScore + DurationZScore + (1|Image), data=natprefdata_zscores, REML=FALSE)
summary(multreg_randomimage_AIC)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: PreferenceZScore ~ NaturalZScore + DurationZScore + (1 | Image)
## Data: natprefdata_zscores
##
## AIC BIC logLik deviance df.resid
## 1077.5 1103.0 -533.8 1067.5 1195
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.202 -0.619 -0.006 0.641 3.115
##
## Random effects:
## Groups Name Variance Std.Dev.
## Image (Intercept) 0.781 0.884
## Residual 0.111 0.334
## Number of obs: 1200, groups: Image, 60
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -0.00000000000000301 0.11449070165759732 0.00
## NaturalZScore 0.32741786910113940 0.11453889929624642 2.86
## DurationZScore 0.00213864918852664 0.00990223236594340 0.22
##
## Correlation of Fixed Effects:
## (Intr) NtrlZS
## NaturalZScr 0.000
## DuratinZScr 0.000 -0.003
AIC1 <- 1077.5
null_randomimage_AIC <- lmer(PreferenceZScore ~ 1 + (1|Image), data=natprefdata_zscores, REML=FALSE)
summary(null_randomimage_AIC)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: PreferenceZScore ~ 1 + (1 | Image)
## Data: natprefdata_zscores
##
## AIC BIC logLik deviance df.resid
## 1081.2 1096.5 -537.6 1075.2 1197
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.202 -0.620 -0.006 0.645 3.128
##
## Random effects:
## Groups Name Variance Std.Dev.
## Image (Intercept) 0.888 0.942
## Residual 0.111 0.334
## Number of obs: 1200, groups: Image, 60
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 0.000000000000004 0.122028809503485 0
nullAIC <- 1081.2
#Can also use AIC to compare alternative models to each other, instead of just the null.
(deltaAIC <- AIC1-nullAIC)
## [1] -3.7
anova(multreg_randomimage_AIC, null_randomimage_AIC)
## Data: natprefdata_zscores
## Models:
## null_randomimage_AIC: PreferenceZScore ~ 1 + (1 | Image)
## multreg_randomimage_AIC: PreferenceZScore ~ NaturalZScore + DurationZScore + (1 | Image)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## null_randomimage_AIC 3 1081 1096 -538 1075
## multreg_randomimage_AIC 5 1078 1103 -534 1068 7.71 2 0.021 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(multreg_randomimage_AIC)
## Computing profile confidence intervals ...
## 2.5 % 97.5 %
## .sig01 0.745328 1.070095
## .sigma 0.320323 0.347735
## (Intercept) -0.228038 0.228038
## NaturalZScore 0.099283 0.555551
## DurationZScore -0.017287 0.021562
#Changing naturalness to a categorical variable
#Two categories
natprefdata_zscores <- natprefdata_zscores %>%
mutate(NatCategory = case_when(
NaturalZScore > 0 ~ "High",
NaturalZScore < 0 ~ "Low"
))
lm_categorical <- lm(PreferenceZScore ~ NatCategory + DurationZScore, data=natprefdata_zscores)
summary(lm_categorical)
##
## Call:
## lm(formula = PreferenceZScore ~ NatCategory + DurationZScore,
## data = natprefdata_zscores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1392 -0.5740 0.0752 0.6685 2.0129
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2713 0.0372 7.3 0.00000000000051 ***
## NatCategoryLow -0.6029 0.0554 -10.9 < 0.0000000000000002 ***
## DurationZScore -0.0387 0.0276 -1.4 0.16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.954 on 1197 degrees of freedom
## Multiple R-squared: 0.0909, Adjusted R-squared: 0.0894
## F-statistic: 59.8 on 2 and 1197 DF, p-value: <0.0000000000000002
#Three categories
quantile(natprefdata_zscores$NaturalZScore, probs = c(.33,.66))
## 33% 66%
## -0.70092 0.90189
hist(natprefdata_zscores$NaturalZScore)
#An interesting use of 'case_when'
natprefdata_zscores <- natprefdata_zscores %>%
mutate(NatCategory3 = case_when(
NaturalZScore > .9 ~ "High",
NaturalZScore < -.7 ~ "Low",
TRUE ~ "Medium"
))
natprefdata_zscores$NatCategory3 <- factor(natprefdata_zscores$NatCategory3, levels = c("Low", "Medium", "High"))
lm_categorical3 <- lm(PreferenceZScore ~ NatCategory3 + DurationZScore, data=natprefdata_zscores)
summary(lm_categorical3)
##
## Call:
## lm(formula = PreferenceZScore ~ NatCategory3 + DurationZScore,
## data = natprefdata_zscores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0670 -0.5883 0.0526 0.6912 2.0252
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.4047 0.0478 -8.46 < 0.0000000000000002 ***
## NatCategory3Medium 0.5180 0.0685 7.56 0.00000000000008 ***
## NatCategory3High 0.6875 0.0669 10.28 < 0.0000000000000002 ***
## DurationZScore -0.0393 0.0276 -1.42 0.16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.956 on 1196 degrees of freedom
## Multiple R-squared: 0.0875, Adjusted R-squared: 0.0853
## F-statistic: 38.3 on 3 and 1196 DF, p-value: <0.0000000000000002
Notice that with categorical, R output doesn’t include all three categories - these betas are now interpreted as the difference in DV associated with going from that category to the base (un-included) category. This is how R deals with dummy coding, we didn’t include one category.
ggplot(data=natprefdata_zscores) +
aes(x=Duration, y = Preference, color=NatCategory3) +
geom_jitter(width = .2) +
geom_smooth(method="lm", formula = "y ~ x")
natpref_categorical <- natprefdata_zscores %>%
mutate(Duration_cat = as.factor(Duration)) %>%
dplyr::select(Subject, Image, Preference, NatCategory3, Duration_cat)
lm_categorical_both <- lm(Preference ~ NatCategory3 + Duration_cat, data=natpref_categorical)
summary(lm_categorical_both)
##
## Call:
## lm(formula = Preference ~ NatCategory3 + Duration_cat, data = natpref_categorical)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.652 -0.508 0.043 0.596 1.734
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7766 0.0524 91.09 < 0.0000000000000002 ***
## NatCategory3Medium 0.4483 0.0591 7.58 0.000000000000067 ***
## NatCategory3High 0.5922 0.0576 10.28 < 0.0000000000000002 ***
## Duration_cat2 -0.0747 0.0583 -1.28 0.20
## Duration_cat3 -0.0829 0.0583 -1.42 0.16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.824 on 1195 degrees of freedom
## Multiple R-squared: 0.0879, Adjusted R-squared: 0.0848
## F-statistic: 28.8 on 4 and 1195 DF, p-value: <0.0000000000000002
confint(lm_categorical_both)
## 2.5 % 97.5 %
## (Intercept) 4.67371 4.879481
## NatCategory3Medium 0.33230 0.564252
## NatCategory3High 0.47921 0.705258
## Duration_cat2 -0.18909 0.039793
## Duration_cat3 -0.19729 0.031535
#Merging other image features
imagefeatures <- read_csv("RegressionImageData.csv")
## Parsed with column specification:
## cols(
## Image = col_double(),
## EdgeDensity = col_double(),
## Order = col_double(),
## Natural = col_double(),
## Habitable = col_double(),
## Preference = col_double()
## )
#Here we don't want preference or naturalness from the new data because they are already in the current data set, so we are dropping them. Note, the car package also has a select function, so here I'm specifying that I want dplyr's version of that
imagefeatures <- imagefeatures %>% dplyr::select(-c(Preference, Natural))
#Check out the different join options in help. Joins will use a common variable (in this case our Image column) to merge data. Left join is used here because we want to keep all columns from the first data set, and match the data to them from the second data set.
#Can also do: natprefdata_2 <- merge(natprefdata, imagefeatures)
natprefdata_2 <- natprefdata %>% left_join(imagefeatures)
## Joining, by = "Image"
#Now we can run the new lm
morevariables_lm <- lm(Preference ~ Natural + Order + Habitable + Duration, data = natprefdata_2)
summary(morevariables_lm)
##
## Call:
## lm(formula = Preference ~ Natural + Order + Habitable + Duration,
## data = natprefdata_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0841 -0.3903 -0.0113 0.4393 1.6832
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3900 0.1618 2.41 0.016 *
## Natural 0.4091 0.0150 27.31 <0.0000000000000002 ***
## Order 0.2744 0.0235 11.67 <0.0000000000000002 ***
## Habitable 0.3445 0.0192 17.92 <0.0000000000000002 ***
## Duration -0.0198 0.0225 -0.88 0.378
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.635 on 1195 degrees of freedom
## Multiple R-squared: 0.458, Adjusted R-squared: 0.456
## F-statistic: 253 on 4 and 1195 DF, p-value: <0.0000000000000002
Can check Variance Inflation Factor for this model VIFj = 1/(1-R^2j)
Rules of Thumb:
A VIF of 1 means there is no correlation; no sign of multicollinearity
A VIF of 4 means 75% of the variance is explained by these predictors – potential sign of multicollinearity and needs more investigation
A VIF of 10 means 90% of the variance is explained – serious multicollinearity!
#library(car)
vif(morevariables_lm)
## Natural Order Habitable Duration
## 2.7762 1.3467 3.3080 1.0032
There are other packages and methods for model selection in R; this is the easiest one to use in my opinion.
#library(MASS) #Already loaded as a required package by something else we are using
# Fit the full model
full.model <- lm(Preference ~., data = natprefdata_2[,4:8])
# Stepwise regression model
step.model <- stepAIC(full.model, direction = "both",
trace = FALSE)
summary(step.model)
##
## Call:
## lm(formula = Preference ~ Natural + Order + Habitable, data = natprefdata_2[,
## 4:8])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0660 -0.3913 -0.0141 0.4375 1.6624
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3471 0.1542 2.25 0.025 *
## Natural 0.4093 0.0150 27.34 <0.0000000000000002 ***
## Order 0.2742 0.0235 11.66 <0.0000000000000002 ***
## Habitable 0.3452 0.0192 17.97 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.635 on 1196 degrees of freedom
## Multiple R-squared: 0.458, Adjusted R-squared: 0.457
## F-statistic: 337 on 3 and 1196 DF, p-value: <0.0000000000000002
#Forward selection
step.forward <- stepAIC(full.model, direction = "forward", trace = FALSE)
summary(step.forward)
##
## Call:
## lm(formula = Preference ~ Natural + EdgeDensity + Order + Habitable,
## data = natprefdata_2[, 4:8])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0367 -0.3933 -0.0066 0.4430 1.6489
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2751 0.1724 1.60 0.11
## Natural 0.4054 0.0155 26.10 <0.0000000000000002 ***
## EdgeDensity 0.6898 0.7375 0.94 0.35
## Order 0.2780 0.0239 11.65 <0.0000000000000002 ***
## Habitable 0.3423 0.0195 17.58 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.635 on 1195 degrees of freedom
## Multiple R-squared: 0.458, Adjusted R-squared: 0.457
## F-statistic: 253 on 4 and 1195 DF, p-value: <0.0000000000000002
#Backward selection
step.backward <- stepAIC(full.model, direction = "backward", trace = FALSE)
summary(step.backward)
##
## Call:
## lm(formula = Preference ~ Natural + Order + Habitable, data = natprefdata_2[,
## 4:8])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0660 -0.3913 -0.0141 0.4375 1.6624
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3471 0.1542 2.25 0.025 *
## Natural 0.4093 0.0150 27.34 <0.0000000000000002 ***
## Order 0.2742 0.0235 11.66 <0.0000000000000002 ***
## Habitable 0.3452 0.0192 17.97 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.635 on 1196 degrees of freedom
## Multiple R-squared: 0.458, Adjusted R-squared: 0.457
## F-statistic: 337 on 3 and 1196 DF, p-value: <0.0000000000000002
What if our DV was binary? We would need to run logistic regression. More on how to interpret this and other types of regression next quarter!
natprefdata_2 <- natprefdata_2 %>%
mutate(Pref_binary = case_when(
Preference > median(Preference) ~ 1,
Preference < median(Preference) ~ 0
))
natprefdata_2 %>% count(Pref_binary)
## # A tibble: 2 x 2
## Pref_binary n
## <dbl> <int>
## 1 0 600
## 2 1 600
log_regression <- glm(Pref_binary ~ Natural + Order + Habitable + Duration, data = natprefdata_2, family = "binomial")
summary(log_regression)
##
## Call:
## glm(formula = Pref_binary ~ Natural + Order + Habitable + Duration,
## family = "binomial", data = natprefdata_2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0493 -0.9198 0.0326 0.8787 2.6255
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.6241 0.8388 -12.67 <0.0000000000000002 ***
## Natural 1.1210 0.0754 14.86 <0.0000000000000002 ***
## Order 0.1605 0.1008 1.59 0.11
## Habitable 1.1705 0.0869 13.47 <0.0000000000000002 ***
## Duration -0.1324 0.0826 -1.60 0.11
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1663.6 on 1199 degrees of freedom
## Residual deviance: 1300.4 on 1195 degrees of freedom
## AIC: 1310
##
## Number of Fisher Scoring iterations: 4
I had us calculate Z-scores manually above, but R has a built in function for that. Z-scoring is a linear transformation - it doesn’t change the shape of your distribution. Sometimes you may also want to do non-linear transformations. These can help make skewed data more normal.
x <- rexp(100, rate = .1)
hist(x)
summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.373 3.511 8.088 10.355 12.925 48.328
xZ <- scale(x)
summary(xZ)
## V1
## Min. :-1.060
## 1st Qu.:-0.727
## Median :-0.241
## Mean : 0.000
## 3rd Qu.: 0.273
## Max. : 4.034
hist(xZ)
#Note - if you have 0s, log1p(x) will accurately calculate log(x + 1) for all numbers
logx <- log(x)
hist(x)
#You can also Z score after log transformation to get all variables in a similar scale. Sometimes this makes interpretation more difficult, so I usually don't do this if the variable is my DV.
logxZ <- scale(x)
hist(xZ)
Interpreting betas after different transformations:
If DV log-transformed:
log(DV) = beta*IV
Do: (exp(beta) - 1)*100
This equals the percent change in DV for one unit change in IV
Example: beta=.198, exp(.198-1)*100 = 22% change in DV
If IV log-transformed
DV = beta*log(IV)
Do: beta/100
This equals the unit change in DV for a 1% change in IV
OR
Do: For x percent increase, multiply the coefficient by log(1.x).
Example: beta=.198, For every 10% increase in the independent variable, dependent variable increases by about 0.198 * log(1.10) = 0.02 units
More info here: https://data.library.virginia.edu/interpreting-log-transformations-in-a-linear-model/
If z-scored DV and IV
scaled(DV) = beta*scaled(IV)
1 SD change in IV for beta-SD change in DV
cor(natprefdata_2$Natural, natprefdata_2$Habitable)
## [1] -0.79465
cor(natprefdata_2$Natural, natprefdata_2$Habitable, method = "spearman")
## [1] -0.77314
cor.test(natprefdata_2$Natural, natprefdata_2$Habitable)
##
## Pearson's product-moment correlation
##
## data: natprefdata_2$Natural and natprefdata_2$Habitable
## t = -45.3, df = 1198, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.81461 -0.77281
## sample estimates:
## cor
## -0.79465
cor.test(natprefdata_2$Natural, natprefdata_2$Habitable, method = "spearman")
## Warning in cor.test.default(natprefdata_2$Natural, natprefdata_2$Habitable, :
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: natprefdata_2$Natural and natprefdata_2$Habitable
## S = 510662977, p-value <0.0000000000000002
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.77314
#library(ppcor)
#Regulr correlation
cor.test(natprefdata_2$Natural, natprefdata_2$Preference)
##
## Pearson's product-moment correlation
##
## data: natprefdata_2$Natural and natprefdata_2$Preference
## t = 12, df = 1198, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.27601 0.37709
## sample estimates:
## cor
## 0.32749
#Partial correlation: between Naturalness and Preference controlling for Order
pcor.test(x=natprefdata_2$Natural, y=natprefdata_2$Preference, z=natprefdata_2$Order)
## estimate p.value statistic n gp Method
## 1 0.47812 1.6626e-69 18.834 1200 1 pearson
# Partial correlation between two variables controlling for all other variables in the data frame
pcor(natprefdata_2[,4:8])
## $estimate
## Natural Preference EdgeDensity Order Habitable
## Natural 1.000000 0.602566 0.195956 -0.051076 -0.82866
## Preference 0.602566 1.000000 0.027048 0.319396 0.45324
## EdgeDensity 0.195956 0.027048 1.000000 -0.169131 0.13390
## Order -0.051076 0.319396 -0.169131 1.000000 0.22524
## Habitable -0.828663 0.453243 0.133904 0.225242 1.00000
##
## $p.value
## Natural Preference EdgeDensity Order Habitable
## Natural 0.0000e+00 3.3170e-119 0.000000000007977 7.7327e-02 1.9766e-303
## Preference 3.3170e-119 0.0000e+00 0.349798656876870 8.5951e-30 1.0830e-61
## EdgeDensity 7.9770e-12 3.4980e-01 0.000000000000000 3.9111e-09 3.3373e-06
## Order 7.7327e-02 8.5951e-30 0.000000003911125 0.0000e+00 3.1134e-15
## Habitable 1.9766e-303 1.0830e-61 0.000003337257625 3.1134e-15 0.0000e+00
##
## $statistic
## Natural Preference EdgeDensity Order Habitable
## Natural 0.0000 26.10048 6.90789 -1.7679 -51.1763
## Preference 26.1005 0.00000 0.93535 11.6514 17.5772
## EdgeDensity 6.9079 0.93535 0.00000 -5.9321 4.6710
## Order -1.7679 11.65143 -5.93213 0.0000 7.9917
## Habitable -51.1763 17.57717 4.67097 7.9917 0.0000
##
## $n
## [1] 1200
##
## $gp
## [1] 3
##
## $method
## [1] "pearson"
# Semi-partial correlation: Between naturalness and preference, while just controlling order for naturalness. Note the order. The "control" from z is applied to the y-variable
spcor.test(x=natprefdata_2$Preference, y=natprefdata_2$Natural, z=natprefdata_2$Order)
## estimate p.value statistic n gp Method
## 1 0.45169 2.4891e-61 17.516 1200 1 pearson
#This would get the semi-partial correlation between naturalness and preference while holding order constant for preference
spcor.test(x=natprefdata_2$Natural, y=natprefdata_2$Preference, z=natprefdata_2$Order)
## estimate p.value statistic n gp Method
## 1 0.45446 3.7282e-62 17.651 1200 1 pearson
Corrplot package - graphical representations of correlation matrices https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
library(corrplot)
source("cor.mtest.R")
cormat <- cor(natprefdata_2[,4:8])
cormat
## Natural Preference EdgeDensity Order Habitable
## Natural 1.00000 0.32749 0.26622 -0.31065 -0.79465
## Preference 0.32749 1.00000 0.11821 0.32787 0.06615
## EdgeDensity 0.26622 0.11821 1.00000 -0.18351 -0.15223
## Order -0.31065 0.32787 -0.18351 1.00000 0.49035
## Habitable -0.79465 0.06615 -0.15223 0.49035 1.00000
A <- cor.mtest(natprefdata_2[,4:8], .95)
corrplot(cormat, method="number", type = "lower", p.mat = A[[1]], sig.level = .001, tl.col = "black")
pairs(natprefdata_2[,4:8])
Stargazer package - outputs tables of models for papers https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf
library(stargazer)
stargazer(multreg1, multreg_interaction1, type="text")
##
## ==============================================================================
## Dependent variable:
## -------------------------------------------------
## PreferenceZScore
## (1) (2)
## ------------------------------------------------------------------------------
## NaturalZScore 0.329*** 0.329***
## (0.027) (0.027)
##
## DurationZScore -0.041 -0.041
## (0.027) (0.027)
##
## NaturalZScore:DurationZScore -0.015
## (0.027)
##
## Constant -0.000 0.0005
## (0.027) (0.027)
##
## ------------------------------------------------------------------------------
## Observations 1,200 1,200
## R2 0.109 0.109
## Adjusted R2 0.107 0.107
## Residual Std. Error 0.945 (df = 1197) 0.945 (df = 1196)
## F Statistic 73.154*** (df = 2; 1197) 48.846*** (df = 3; 1196)
## ==============================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
stargazer(multreg1, multreg_interaction1, type="text", intercept.bottom = FALSE, intercept.top = TRUE, ci = TRUE, single.row = T)
##
## ==============================================================================
## Dependent variable:
## -------------------------------------------------
## PreferenceZScore
## (1) (2)
## ------------------------------------------------------------------------------
## Constant -0.000 (-0.053, 0.053) 0.0005 (-0.053, 0.054)
## NaturalZScore 0.329*** (0.275, 0.382) 0.329*** (0.275, 0.382)
## DurationZScore -0.041 (-0.094, 0.013) -0.041 (-0.095, 0.013)
## NaturalZScore:DurationZScore -0.015 (-0.068, 0.038)
## ------------------------------------------------------------------------------
## Observations 1,200 1,200
## R2 0.109 0.109
## Adjusted R2 0.107 0.107
## Residual Std. Error 0.945 (df = 1197) 0.945 (df = 1196)
## F Statistic 73.154*** (df = 2; 1197) 48.846*** (df = 3; 1196)
## ==============================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Lots of other R packages available for creating nicer tables as well: knitr (kable()), apaTables, formattable, for example.